Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
PLoS One ; 18(4): e0285212, 2023.
Article in English | MEDLINE | ID: covidwho-2294898

ABSTRACT

Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals' private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.


Subject(s)
Big Data , Software , Humans , Data Anonymization , Algorithms , Computational Biology
2.
Sci Data ; 9(1): 776, 2022 12 21.
Article in English | MEDLINE | ID: covidwho-2185972

ABSTRACT

Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.


Subject(s)
COVID-19 , Humans , Bias , Data Anonymization , Models, Theoretical , Privacy , Data Interpretation, Statistical , Datasets as Topic
3.
Geneva; World Health Organization; 2022. (WHO/2019-nCoV/Clinical/Pregnancy/Analytic_plan/2022.1).
in English | WHOIRIS | ID: gwh-361256
4.
Cien Saude Colet ; 25(suppl 1): 2487-2492, 2020 Jun.
Article in Portuguese, English | MEDLINE | ID: covidwho-1725055

ABSTRACT

Data has become increasingly important and valuable for both scientists and health authorities searching for answers to the COVID-19 crisis. Due to difficulties in diagnosing this infection in populations around the world, initiatives supported by digital technologies are being developed by governments and private companies to enable the tracking of the public's symptoms, contacts and movements. Considering the current scenario, initiatives designed to support infection surveillance and monitoring are essential and necessary. Nonetheless, ethical, legal and technical questions abound regarding the amount and types of personal data being collected, processed, shared and used in the name of public health, as well as the concomitant or posterior use of this data. These challenges demonstrate the need for new models of responsible and transparent data and technology governance in efforts to control SARS-COV2, as well as in future public health emergencies.


Dados ganham cada vez mais importância e valor na busca de respostas para enfrentar a COVID-19 tanto para a ciência quanto para as autoridades sanitárias. Em virtude da dificuldade de realizar diagnóstico da infecção na população em geral, iniciativas apoiadas em tecnologias digitais vêm sendo desenvolvidas por governos ou empresas privadas para possibilitar rastreamentos de sintomas, contatos e deslocamentos de modo a apoiar estratégias de acompanhamento e avaliação na vigilância de contágios. A despeito da importância e necessidade dessas iniciativas, questionamentos acerca da quantidade e tipos de dados pessoais coletados, processados, compartilhados e utilizados em nome da saúde pública, bem como os concomitantes ou posteriores usos desses dados, suscitam questionamentos éticos, legais e técnicos. Desafios que apontam para a necessidade de novos modelos de governança de dados e de tecnologias, responsáveis e transparentes, para controlar o Sars-Cov2 e as futuras emergências de saúde pública.


Subject(s)
Betacoronavirus , Coronavirus Infections/epidemiology , Global Health , Health Records, Personal , Pandemics , Pneumonia, Viral/epidemiology , Population Surveillance/methods , Privacy , COVID-19 , Confidentiality , Contact Tracing/methods , Data Anonymization , Humans , SARS-CoV-2 , Social Media
6.
PLoS One ; 17(1): e0262609, 2022.
Article in English | MEDLINE | ID: covidwho-1643269

ABSTRACT

BACKGROUND: The use of linked healthcare data in research has the potential to make major contributions to knowledge generation and service improvement. However, using healthcare data for secondary purposes raises legal and ethical concerns relating to confidentiality, privacy and data protection rights. Using a linkage and anonymisation approach that processes data lawfully and in line with ethical best practice to create an anonymous (non-personal) dataset can address these concerns, yet there is no set approach for defining all of the steps involved in such data flow end-to-end. We aimed to define such an approach with clear steps for dataset creation, and to describe its utilisation in a case study linking healthcare data. METHODS: We developed a data flow protocol that generates pseudonymous datasets that can be reversibly linked, or irreversibly linked to form an anonymous research dataset. It was designed and implemented by the Comprehensive Patient Records (CPR) study in Leeds, UK. RESULTS: We defined a clear approach that received ethico-legal approval for use in creating an anonymous research dataset. Our approach used individual-level linkage through a mechanism that is not computer-intensive and was rendered irreversible to both data providers and processors. We successfully applied it in the CPR study to hospital and general practice and community electronic health record data from two providers, along with patient reported outcomes, for 365,193 patients. The resultant anonymous research dataset is available via DATA-CAN, the Health Data Research Hub for Cancer in the UK. CONCLUSIONS: Through ethical, legal and academic review, we believe that we contribute a defined approach that represents a framework that exceeds current minimum standards for effective pseudonymisation and anonymisation. This paper describes our methods and provides supporting information to facilitate the use of this approach in research.


Subject(s)
Biomedical Research/methods , Confidentiality , Data Anonymization , Biomedical Research/ethics , Datasets as Topic , Electronic Data Processing/ethics , Electronic Data Processing/methods , Electronic Health Records/organization & administration , Humans , Information Storage and Retrieval , United Kingdom
8.
Geneva; World Health Organization; 2021. (WHO/2019-nCoV/Clinical_CRF/2021.1).
in English, Arabic, Russian, Chinese | WHOIRIS | ID: gwh-350007
9.
Geneva; World Health Organization; 2021. (WHO/2019-nCoV/Clinical/Analytic_plan/2021.1).
in English | WHOIRIS | ID: gwh-342435
10.
Public Health Rep ; 136(5): 554-561, 2021.
Article in English | MEDLINE | ID: covidwho-1277841

ABSTRACT

OBJECTIVES: Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data. METHODS: We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts. RESULTS: Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com. PRACTICE IMPLICATIONS: Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data.


Subject(s)
COVID-19/epidemiology , Centers for Disease Control and Prevention, U.S./organization & administration , Confidentiality/standards , Data Anonymization/standards , Centers for Disease Control and Prevention, U.S./standards , Humans , Pandemics , SARS-CoV-2 , United States/epidemiology
11.
Int J Health Geogr ; 20(1): 3, 2021 01 07.
Article in English | MEDLINE | ID: covidwho-1035104

ABSTRACT

BACKGROUND: Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy. METHODS: We analyzed the impact of perturbation on spatial patterns in the full set of address-level mortality data from Lawrence, MA during the period from 1911 to 1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran's I, Local Moran's I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley's K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and its compliance with HIPAA and GDPR privacy standards. RESULTS: Random perturbation at 50 m, donut masking between 5 and 50 m, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100 × 100 and 250 × 250 m cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard. CONCLUSIONS: Using the set of published perturbation methods applied in this analysis, HIPAA and GDPR compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.


Subject(s)
Data Anonymization , Privacy , Cluster Analysis , Confidentiality , Humans , Reproducibility of Results
13.
Sci Data ; 7(1): 435, 2020 12 10.
Article in English | MEDLINE | ID: covidwho-972239

ABSTRACT

The Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS) is a European registry for studying the epidemiology and clinical course of COVID-19. To support evidence-generation at the rapid pace required in a pandemic, LEOSS follows an Open Science approach, making data available to the public in real-time. To protect patient privacy, quantitative anonymization procedures are used to protect the continuously published data stream consisting of 16 variables on the course and therapy of COVID-19 from singling out, inference and linkage attacks. We investigated the bias introduced by this process and found that it has very little impact on the quality of output data. Current laws do not specify requirements for the application of formal anonymization methods, there is a lack of guidelines with clear recommendations and few real-world applications of quantitative anonymization procedures have been described in the literature. We therefore believe that our work can help others with developing urgently needed anonymization pipelines for their projects.


Subject(s)
COVID-19/epidemiology , Data Anonymization , Pandemics , Registries , Adult , Aged , Aged, 80 and over , Biomedical Research , Confidentiality , Datasets as Topic , Female , Humans , Male , Middle Aged
17.
Geneva; World Health Organization; 2020. (WHO/2019-nCoV/Clinical_CRF/2020.4).
in English, Arabic, Russian | WHOIRIS | ID: gwh-333229
18.
Genève; Organisation mondiale de la Santé; 2020. (WHO/2019-nCoV/Clinical_CRF/2020.3).
in French | WHOIRIS | ID: gwh-331794
19.
Ginebra; Organización Mundial de la Salud; 2020. (WHO/2019-nCoV/Clinical_CRF/2020.3).
in Spanish | WHOIRIS | ID: gwh-331793
SELECTION OF CITATIONS
SEARCH DETAIL